Reading and Writing files using Pandas

Pandas is one of the most popular Python libraries which provides a user-friendly interface to reading, presenting and writing files. It also has some additional features, such as plotting, time series analysis, missing value handling etc.


In [1]:
import pandas as pd

Part 1: reading a .csv file


In [2]:
data_csv = pd.read_csv("titanic.csv")

In [4]:
data_csv.head()


Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Part 2: reading .txt files

CSV stands for Comma Separated Values, as the values/variables in .csv files are separated by commas. Similarly, variables/values in .txt filesa are separated by tabs (" "). It is also often called as tab-separated file. To read .txt files in pandas we again use the same read_csv() function, yet this time we pass another argument besides name of the file: the separator (which should be a tab/whitespace for .txt file).


In [6]:
data_txt = pd.read_csv("imagine_lyrics.txt", sep=" ")

In [7]:
data_txt.head()


Out[7]:
Imagine by John LennonImagine all the people, Unnamed: 7
0 living life in peace... NaN NaN NaN NaN
1 \tJohn Lennon NaN NaN NaN NaN NaN NaN

Part 3: reading .html files

Pandas also has a read_html() functino similar to read_csv(), which reads the html files. All of those functions can read the files directly from the web/url. Let's use the URL of careercenter to read the page content provided in HTML.


In [9]:
data_html = pd.read_html("https://careercenter.am/")


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-495f8adc6296> in <module>()
----> 1 data_html = pd.read_html("https://careercenter.am/")

C:\Program Files\Anaconda2\lib\site-packages\pandas\io\html.pyc in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
    894                   thousands=thousands, attrs=attrs, encoding=encoding,
    895                   decimal=decimal, converters=converters, na_values=na_values,
--> 896                   keep_default_na=keep_default_na)

C:\Program Files\Anaconda2\lib\site-packages\pandas\io\html.pyc in _parse(flavor, io, match, attrs, encoding, **kwargs)
    731             break
    732     else:
--> 733         raise_with_traceback(retained)
    734 
    735     ret = []

C:\Program Files\Anaconda2\lib\site-packages\pandas\io\html.pyc in _parse(flavor, io, match, attrs, encoding, **kwargs)
    725 
    726         try:
--> 727             tables = p.parse_tables()
    728         except Exception as caught:
    729             retained = caught

C:\Program Files\Anaconda2\lib\site-packages\pandas\io\html.pyc in parse_tables(self)
    194 
    195     def parse_tables(self):
--> 196         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    197         return (self._build_table(table) for table in tables)
    198 

C:\Program Files\Anaconda2\lib\site-packages\pandas\io\html.pyc in _parse_tables(self, doc, match, attrs)
    424 
    425         if not tables:
--> 426             raise ValueError('No tables found')
    427 
    428         result = []

ValueError: No tables found

As you can see we receive an error here. The problem is that the read_html() function reads only HTML tables from the website, while no table could be found on careercenter webpage. If you check the source of their website you will see that there is no content. The content is generated trough another file called ccidxann.php. This means we should copy the link to that file and scrape it instead.


In [11]:
data_html = pd.read_html("https://careercenter.am/ccidxann.php")

In [12]:
data_html.head()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-035a0e7d72b1> in <module>()
----> 1 data_html.head()

AttributeError: 'list' object has no attribute 'head'

Now, the head() function can no longer be used, as our data is saved as a list, rather than a dataframe. So let's just print it.


In [13]:
print data_html


[                     0                                                  1
0    JOB OPPORTUNITIES                                                NaN
1                  NaN                     Chief Accountant / Noyan Tapan
2                  NaN  Leading Loan Specialist of Microcredit Block i...
3                  NaN                Senior Internal Auditor / FINCA UCO
4                  NaN                     Credit Officer / Prometey Bank
5                  NaN  Director / Civic Development and Partnership F...
6                  NaN                  Finance Director / Reso Insurance
7                  NaN  FTTB, ADSL/ VDSL Networks Monitoring Technical...
8                  NaN               Digital Platforms Manager / ArmenTel
9                  NaN                           Consultant/ Seller / TST
10                 NaN      Operations Research Developer / Optym Armenia
11                 NaN  Product Manager / Berlin-Chemie Armenian Repre...
12                 NaN               Policy Analyst / UNDP Armenia Office
13                 NaN                           Front-End Developer / 4H
14                 NaN  Specialist of Reconciliation Division / ArmSwi...
15                 NaN  Specialist of Loans Processing and Reporting D...
16                 NaN                      Accountant / Zeppelin Armenia
17                 NaN               Head of Digital Banking / Ameriabank
18                 NaN                                Data Analyst / IPSC
19                 NaN  Account Manager, Client Service Department / M...
20                 NaN     Digital Marketing Specialist / McCann Erickson
21                 NaN  Medical Representative/ Medical Equipment Spec...
22                 NaN  Head of Finance Management/ Chief Accountant /...
23                 NaN             Mobile UI/ UX Designer / Prometey Bank
24                 NaN                        Receptionist / Envoy Hostel
25                 NaN  Consultant on Cost Benefit Analysis of Alterna...
26                 NaN              Digital Innovations Specialist / Ucom
27                 NaN                    Graphic Designer / Baldi Retail
28                 NaN  Head of Operational Risk Assessment and Monito...
29                 NaN  Head of Operational Risk Management Department...
..                 ...                                                ...
119                NaN             Senior JavaScript Developer / Digitain
120                NaN                         JavaScript Developer / SFL
121                NaN                    Procurement Manager / Telia-Med
122                NaN         User Behavior Research Scientist / PicsArt
123                NaN         Deep Learning Research Scientist / PicsArt
124                NaN               Senior Software Developer / XNTrends
125                NaN                       UI/ UX Designer / IUNetworks
126                NaN             Engineering Director / Ginosi Apartels
127                NaN          Senior Systems Engineer / Ginosi Apartels
128                NaN         Senior Android Developer / Ginosi Apartels
129                NaN               Senior Java Developer / EPAM Systems
130                NaN              Digital Marketing Specialist / Lesona
131                NaN             Rental Agent for "Sixt" Armenia / Fora
132                NaN                  IT Project Coordinator / Altacode
133                NaN  Head of Technical Production Department / Doro...
134                NaN  Application Engineer, Place and Route Departme...
135                NaN  Software Engineer / Mentor Graphics Developmen...
136                NaN                Director of Engineering / Workfront
137                NaN                 CNC Machine Operator / Carrara Rus
138                NaN                          Storekeeper / Carrara Rus
139                NaN                           Accountant / Carrara Rus
140                NaN                        Technician/ Installer / TST
141                NaN                           Accounting Manager / TST
142                NaN                       IT Specialist / ArmSwissBank
143                NaN             Road Construction Engineer / Dorozhnik
144                NaN  Digital Marketing Specialist / Andava Digital ...
145                NaN       Stand Customer Service Specialist / Varks.am
146                NaN               Doctor Expert / Rosgosstrakh-Armenia
147                NaN                       WordPress Developer / Reload
148                NaN                   Mechanical Engineer / Imex Group

[149 rows x 2 columns],              0                                          1
0  INTERNSHIPS                                        NaN
1          NaN          Branch Intern / HSBC Bank Armenia
2          NaN  Contact Center Intern / HSBC Bank Armenia,            0                                         1
0  TRAININGS                                       NaN
1        NaN  English Language Courses / Career Center,               0                                                  1
0  COMPETITIONS                                                NaN
1           NaN  Invitation to Bid - ITB/ARM/01/2017 - Sale of ...
2           NaN  Call for Designing Companies for SMEDA Project...]

We may check the length of the list to understand how many elements it has. Basically, each element will be one separate table.


In [14]:
len(data_html)


Out[14]:
4

In [15]:
data_html[0]


Out[15]:
0 1
0 JOB OPPORTUNITIES NaN
1 NaN Chief Accountant / Noyan Tapan
2 NaN Leading Loan Specialist of Microcredit Block i...
3 NaN Senior Internal Auditor / FINCA UCO
4 NaN Credit Officer / Prometey Bank
5 NaN Director / Civic Development and Partnership F...
6 NaN Finance Director / Reso Insurance
7 NaN FTTB, ADSL/ VDSL Networks Monitoring Technical...
8 NaN Digital Platforms Manager / ArmenTel
9 NaN Consultant/ Seller / TST
10 NaN Operations Research Developer / Optym Armenia
11 NaN Product Manager / Berlin-Chemie Armenian Repre...
12 NaN Policy Analyst / UNDP Armenia Office
13 NaN Front-End Developer / 4H
14 NaN Specialist of Reconciliation Division / ArmSwi...
15 NaN Specialist of Loans Processing and Reporting D...
16 NaN Accountant / Zeppelin Armenia
17 NaN Head of Digital Banking / Ameriabank
18 NaN Data Analyst / IPSC
19 NaN Account Manager, Client Service Department / M...
20 NaN Digital Marketing Specialist / McCann Erickson
21 NaN Medical Representative/ Medical Equipment Spec...
22 NaN Head of Finance Management/ Chief Accountant /...
23 NaN Mobile UI/ UX Designer / Prometey Bank
24 NaN Receptionist / Envoy Hostel
25 NaN Consultant on Cost Benefit Analysis of Alterna...
26 NaN Digital Innovations Specialist / Ucom
27 NaN Graphic Designer / Baldi Retail
28 NaN Head of Operational Risk Assessment and Monito...
29 NaN Head of Operational Risk Management Department...
... ... ...
119 NaN Senior JavaScript Developer / Digitain
120 NaN JavaScript Developer / SFL
121 NaN Procurement Manager / Telia-Med
122 NaN User Behavior Research Scientist / PicsArt
123 NaN Deep Learning Research Scientist / PicsArt
124 NaN Senior Software Developer / XNTrends
125 NaN UI/ UX Designer / IUNetworks
126 NaN Engineering Director / Ginosi Apartels
127 NaN Senior Systems Engineer / Ginosi Apartels
128 NaN Senior Android Developer / Ginosi Apartels
129 NaN Senior Java Developer / EPAM Systems
130 NaN Digital Marketing Specialist / Lesona
131 NaN Rental Agent for "Sixt" Armenia / Fora
132 NaN IT Project Coordinator / Altacode
133 NaN Head of Technical Production Department / Doro...
134 NaN Application Engineer, Place and Route Departme...
135 NaN Software Engineer / Mentor Graphics Developmen...
136 NaN Director of Engineering / Workfront
137 NaN CNC Machine Operator / Carrara Rus
138 NaN Storekeeper / Carrara Rus
139 NaN Accountant / Carrara Rus
140 NaN Technician/ Installer / TST
141 NaN Accounting Manager / TST
142 NaN IT Specialist / ArmSwissBank
143 NaN Road Construction Engineer / Dorozhnik
144 NaN Digital Marketing Specialist / Andava Digital ...
145 NaN Stand Customer Service Specialist / Varks.am
146 NaN Doctor Expert / Rosgosstrakh-Armenia
147 NaN WordPress Developer / Reload
148 NaN Mechanical Engineer / Imex Group

149 rows × 2 columns


In [16]:
data_html[1]


Out[16]:
0 1
0 INTERNSHIPS NaN
1 NaN Branch Intern / HSBC Bank Armenia
2 NaN Contact Center Intern / HSBC Bank Armenia

In [17]:
data_html[2]


Out[17]:
0 1
0 TRAININGS NaN
1 NaN English Language Courses / Career Center

In [18]:
data_html[3]


Out[18]:
0 1
0 COMPETITIONS NaN
1 NaN Invitation to Bid - ITB/ARM/01/2017 - Sale of ...
2 NaN Call for Designing Companies for SMEDA Project...

Let's take only the job postings table which had 2 columns as all the others. The first column has only NaN values, so we will chose only the second one and save it as our data for analysis.


In [19]:
data = data_html[0][1]

Now we have a dataframe, which can already be used together with the head() and other functions.


In [20]:
data.head()


Out[20]:
0                                                  NaN
1                       Chief Accountant / Noyan Tapan
2    Leading Loan Specialist of Microcredit Block i...
3                  Senior Internal Auditor / FINCA UCO
4                       Credit Officer / Prometey Bank
Name: 1, dtype: object

Part 4: reading other files

Pandas has also functinos for reading Excel, Stata, SAS, JSON, SQL and other files. You may check the official documentation for details.

Part 5: writing to files

Writing in Pandas is as easy as reading. You just need to use another function called to_csv (in case of CSV files) for writing reason. Let's take a took at it.


In [21]:
data.to_csv("careercenter_data.csv")

We may now go to our folder to check the csv file.